03. More on the Policy
More on the Policy
In the previous video, you learned how the agent could use a simple neural network architecture to approximate a stochastic policy. The agent passes the current environment state as input to the network, which returns action probabilities. Then, the agent samples from those probabilities to select an action.
)](img/screen-shot-2018-07-01-at-10.54.05-am.png)
Neural network that encodes action probabilities (Source)
The same neural network architecture can be used to approximate a deterministic policy. Instead of sampling from the action probabilities, the agent need only choose the greedy action.
## Quiz
SOLUTION:
softmax## What about continuous action spaces?
The CartPole environment has a discrete action space. So, how do we use a neural network to approximate a policy, if the environment has a continuous action space?
As you learned above, in the case of discrete action spaces, the neural network has one node for each possible action.
For continuous action spaces, the neural network has one node for each action entry (or index). For example, consider the action space of the bipedal walker environment, shown in the figure below.
)](img/screen-shot-2018-07-01-at-11.28.57-am.png)
Action space of BipedalWalker-v2
(Source)
In this case, any action is a vector of four numbers, so the output layer of the policy network will have four nodes.
Since every entry in the action must be a number between -1 and 1, we will add a tanh activation function to the output layer.
As another example, consider the continuous mountain car benchmark. The action space is shown in the figure below. Note that for this environment, the action must be a value between -1 and 1.
)](img/screen-shot-2018-07-01-at-11.19.22-am.png)
Action space of MountainCarContinuous-v0
(Source)